\section{Conclusion}
\label{sec:conclusion}

\begin{center}
{\it Use a vendor tuned BLAS or ATLAS unless you love pain.}
\end{center}

We have presented several possible optimizations for matrix multiplication of square matrices which aim to increase execution speed by taking advantage of the nature of the machine's memory hierarchy and processor pipelining.  We implemented several of these optimization, in particular, making use of blocking, loop unrolling, and SSE instructions.  We evaluated these optimizations on an AMD processor machine on the NERSC Franklin cluster and on an Intel processor machine with two different compilers and provide a simple analysis and discussion of the results.